Chromosome-level genome assembly of Hippophae rhamnoides variety

Fructus hippophae (Hippophae rhamnoides spp. mongolica×Hippophae rhamnoides sinensis), a hybrid variety of sea buckthorn that Hippophae rhamnoides spp. mongolica serves as the female parent and Hippophae rhamnoides sinensis serves as the male parent, is a traditional plant with great potentials of economic and medical values. Herein, we gained a chromosome-level genome of Fructus hippophae about 918.59 Mb, with the scaffolds N50 reaching 83.65 Mb. Then, we anchored 440 contigs with 97.17% of the total genome sequences onto 12 pseudochromosomes. Next, de-novo, homology and transcriptome assembly strategies were adopted for gene structure prediction. This predicted 36475 protein-coding genes, of which 36226 genes could be functionally annotated. Simultaneously, various strategies were used for quality assessment, both the complete BUSCO value (98.80%) and the mapping rate indicated the high assembly quality. Repetitive elements, which occupied 63.68% of the genome, and 1483600 bp of non-coding RNA were annotated. Here, we provide genomic information on female plants of a popular variety, which can provide data for pan-genomic construction of sea buckthorn and for the resolution of the mechanism of sex differentiation.


Background & Summary
Sea buckthorn (Hippophae), belonging to the Elaeagnaceae family, is a diploid (2n = 2x = 24) deciduous plant with high exploitation values 1,2 .Most sea buckthorn is cultivated in cold zones of Europe and Asia 3,4 .Hippophae is rich in ascorbic acid, carotenoids, healthy fatty acids, and other secondary metabolites [5][6][7] .Previous studies have primarily focused on its medicinal value.Extracts from the leaves and orange-yellow fruit have immunomodulatory potential and antioxidant, anti-viral, and wound-healing properties [8][9][10][11] .Sea buckthorn is also used in traditional medicine for the treatment of pulmonary, cardiac, gastrointestinal, blood, or metabolic disorders [12][13][14][15][16] .It is therefore crucial to decode the genomic information of Sea buckthorn.Three genome of Hippophae were published last year, including Hippophae rhamnoides ssp.sinensis, Hippophae tibetana, and Hippophae gyantsensis which revealing differences in their biological data, such as the genome size and percentage of repeated sequences [17][18][19] .The decoding of further genomic information from other Hippophae subspecies and popular varieties is therefore of importance.
Rapid advances in sequencing technology have made it possible to obtain accurate and high-throughput data at a very low cost 20,21 .However, there is currently no research on Fructus hippophae genomic information.Studies on Fructus hippophae are currently limited to compounds and their related protein targets of Hippophae Fructus oil (HFO), relying on the Traditional Chinese Medicine Systems Pharmacology Database and Analysis Platform (TCMSP: https://old.tcmsp-e.com/tcmsp.php) 22.Other studies have focused on methods of extracting and purifying flavonoids, tannins, and other novel nutritional supplements from Fructus hippophae, which depend on spectrophotometry, chromatography and other chemical methods [23][24][25] .Herein, we integrated three different sequencing datasets for genome assembly, including short reads based on next generation sequencing (NGS) on the MGI platform, Oxford Nanopore Technologies (ONT) long reads, and high-throughput chromatin conformation capture (Hi-C) reads.Structural annotation of protein-coding genes was then carried out by de novo, homology and transcriptome assembly strategies.Next, gene functional annotation was performed by alignment with public databases.These genome-related data will provide a valuable resource for the study of sea buckthorn.

Methods plant materials and genome sequencing.
To study the genome of Fructus hippophae, fresh young leaves were collected from the same wild Fructus hippophae tree which planted in Shigatse, Tibet, China.Total genomic DNA and RNA were extracted using the modified cetyl trimethylammonium bromide (CTAB) method and E.Z.N.A.Total RNA Kit I (Omega Bio-Tek, Norcross, GA, USA), respectively 26 .Then, 150 bp paired-end libraries with an insert size of 250 bp were constructed and sequenced at the MGISEQ-T7 platform.The Hi-C sequencing library was constructed according to the published protocol, and then the crosslinked chromatin was digested with DpnII and ligated after biotinylation.DNA fragments were enriched via the interaction between biotin and blunt-end ligation, and then the enriched library was sequenced on the MGISEQ-T7 platform.DNA Long reads were generated by the Nanopore platform and processed by IsoSeq technology using the SMRT method.A totals of 56 Gb of raw MGI short-read data (61× coverage) with a Q30 exceeding 90% (Table 1), 98 Gb of passed Nanopore long-reads data (93× coverage, the N50 length reaching 36264 bp, the average length reaching 25977 bp) (Table 2), and 79.09 Gb of Hi-C data (86× coverage) with the Q30 reaching 94.23% (Table 1) finally were obtained from the whole-genome sequencing.
Estimatie of genome size.Sequence adaptors, duplications, and low-quality reads from the original paired-end short DNA reads were filtered by Fastp 27 with the parameters -n 0 -l 140.Then, 55 Gb of the clean reads from the MGI library were used to estimate the size, heterozygosity, and repeat content of the genome using  Jellyfish 28 , with a 21-mer frequency and the parameter set as reads_cutoff = 1k, obetaining 47716688158 k-mer.Next, the Genomescope v2.0 29 was used to analyze the K-mer frequency distribution.Ultimately, the genome size was estimated to be 843 Mb with 2.19% heterozygosity and 49.6% repetitive sequences (Fig. 1).
Genome de novo assembly.The genome was assembled by integrating the clean Nanorpore long reads, MGI short reads and Hi-C reads.First, de novo genome assembly was performed by NextDenovo (v2.5.0) (https:// github.com/Nextomics/NextDenovo)with the high-quality ONT reads.Then, the clean NGS reads were used for four-rounds of self-correction and three-rounds of consensus correction by Nextpolish (v1.4.1) 30 with the task parameter = best.Next, the redundant sequences resulting from heterozygosity were removed with the purge-dups (v1.2.5) 31 pipeline.After assembly, a 921.69 Mb draft genome, including 723 contigs and the N50 reaching 14.8 Mb, was obtained (Table 3).Additionally, Hicpro (v3.1.0) 32was used to further validate the Hi-C reads, and 3D-DNA 33 was then used to organize and anchor the contigs into draft chromosomes.Manual check and refinement to the cluster, order, and orientation of the draft assembly were carried out using Juicebox assembly tools 34 .Ultimately, the final genome was 918.59 Mb in size and consisted of 253 scaffolds with an N50   length up to 83.64 Mb, including 12 pseudochromosomes that accounted for 97.14% of the total genome length (Table 4).Circos plot of the distribution of the genomic elements (Fig. 2) was generated by shinyCircos v2.0 (https://venyao.xyz/shinyCircos/)and the heatmap of genome-wide Hi-C data (Fig. 3) of the Fructus hippophae genome chromosomes was drawn by hicexplorer.
Repetitive elements identification.The transposable elements (TEs) in the genome were identified and annotated by Extensive de-novo TE Annotator (EDTA) v2.1.2 35and classified by TEsorter (v1.3) 36 , DeepTE 37 , and LTR_FINDER 38 .Finally, 913550 bp of repeat elements were predicted, occupying 61.02% of the total genome length.The TEs could be classified into five categories after annotation, including long terminal repeats (LTR), tandem inverted repeats (TIR), non-LTR, non-TIR, and others.Of these, Gypsy occupied the highest proportion (35.65%) and was evenly distributed on 12 pseudochromosomes in the genome, followed by Copia with 19.81% occupation and high abundance in the central region of the genome (Table 5, Fig. 2).
protein-coding genes prediction.Simultaneously, Repeatmasker (v4.1.2-p1) 39software was used for repeat masking.The masked genome was then subjected to gene prediction.First, structure annotation of the protein-coding genes was predicted using braker 40 and tsebra 41 software by integrating evidence from homology-, de nove-and transcriptome-based annotations.Maker (v3.01.04) 42 and EVidenceModeler (v1.1.1)pipelines 43   were used to integrate the evidence for non-redundant gene models, and the GFF3 file locating the gene, coding sequence, protein, and mRNA positions was obtained.Finally, a total of 36475 protein-coding genes were predicted, with gene lengths of 158 to 127368 bp.Additionally, 35943 (98.54%) of the predicted genes were allocated to the 12 chromosomes, and the gene distribution showed a higher density at the ends of the chromosomes.Genes function and non-coding RNA annotation.The functional annotations of the predicted genes were further annotated by homologous searches against public databases using BLASTP 44 with the e-value cutoff = 1e-10, including NR, Swissprot 45 , Translated European Molecular Biology Laboratory (TrEMBL) 46 , KOG, GO 47 , KEGG 48 and COG.Overall, 99.31% of the genes were functionally annotated.Among them 98.71%, 99.31%, 70.27%, 53.44%, 44.17%, 50.4%, and 37.83% gene were annotated in NR, TrEMBL 46 , Swissprot 45 , KOG, KEGG 48 , GO 47 and COG 49 databases, respectively (Table 6).Non-conding RNAs were identified using cmscan 50 search against the RNA families database (Rfam) 51

technical Validation
Here, several strategies were taken to assess the genome quality.The completeness of the non-redundant draft genome was evaluated using Benchmarking Universal Single-Copy Orthologs (BUSCO) 68 with the embryophyta odb10 dataset, which consists of 1614 single copy genes with the default parameters.Revealing that 98.8% of these genes exhibited complete coverage.Among them, 87.2% were complete and only 1% were missing (Table 8).Additionally, coverage was also estimated by mapping the NGS reads and ONT reads to the assembled genome with BWA-mem2 (v2.2) (https://github.com/bwa-mem2/bwa-mem2)and minimap2, respectively.The coverage was calculated by SAMtools 69 , indicating that 93.9% of the DNA short reads mapped to the assembled genome.Furthermore, the clean RNA reads were aligned back to the draft genome using HISAT2,with 99.96% of the uniquely mapped transcriptome reads suggesting comprehensive genome coverage.Given the existence of published sea buckthorn genomes, we also compared the gene structure between F. hippophae and other three sea buckthorn species using JCVI.Blocks with a span lower than 10 were filtered out, revealing a strong collinearity relationship (Fig. 4).In summary, the combined results from BUSCO, mapping coverage, and collinearity analysis demonstrate the high quality of our F. hippophae genome.

Fig. 2
Fig. 2 Circos plot of distribution of the Fructus hippophae genomic elements.The tracks indicate (A) length of chromosomes, (B) distribution of genes on different chromosomes, (C) distribution of transposable elements on different chromosomes, (D) distribution of copia elements on different chromosomes, (E) distribution of gypsy elements on different chromosomes, (F) GC content of different chromosomes.The densities of genes, TEs, copia elements, gypsy elements and GC were calculated in 500 kb windows.

Fig. 3
Fig. 3 Heatmap of genome-wide Hi-C data of Fructus hippophae chromosomes.The frequency of Hi-C interaction links is represented by colors, ranges from orange (low) to dark red (high).

Table 1 .
Characteristics of NGS data for genome assembly.

Table 2 .
Characteristics of ONT data for genome assembly.

Table 3 .
Characteristics of the Fructus hippophae genome at contig level.

Table 4 .
Characteristics of the Fructus hippophae genome at scaffold level.

Table 5 .
Summary of transposable elements in Fructus hippophae genome.

Table 6 .
Statistical analysis of the functional gene annotations of the Fructus hippophae genome.

Table 7 .
Classification of non-coding RNA in the Fructus hippophae genome.

Table 8 .
Statistics for genome assessment using BUSCO.